70 research outputs found

    Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Classification studies using gene expression datasets are usually based on small numbers of samples and tens of thousands of genes. The selection of those genes that are important for distinguishing the different sample classes being compared, poses a challenging problem in high dimensional data analysis. We describe a new procedure for selecting significant genes as recursive cluster elimination (RCE) rather than recursive feature elimination (RFE). We have tested this algorithm on six datasets and compared its performance with that of two related classification procedures with RFE.</p> <p>Results</p> <p>We have developed a novel method for selecting significant genes in comparative gene expression studies. This method, which we refer to as SVM-RCE, combines K-means, a clustering method, to identify correlated gene clusters, and Support Vector Machines (SVMs), a supervised machine learning classification method, to identify and score (rank) those gene clusters for the purpose of classification. K-means is used initially to group genes into clusters. Recursive cluster elimination (RCE) is then applied to iteratively remove those clusters of genes that contribute the least to the classification performance. SVM-RCE identifies the clusters of correlated genes that are most significantly differentially expressed between the sample classes. Utilization of gene clusters, rather than individual genes, enhances the supervised classification accuracy of the same data as compared to the accuracy when either SVM or Penalized Discriminant Analysis (PDA) with recursive feature elimination (SVM-RFE and PDA-RFE) are used to remove genes based on their individual discriminant weights.</p> <p>Conclusion</p> <p>SVM-RCE provides improved classification accuracy with complex microarray data sets when it is compared to the classification accuracy of the same datasets using either SVM-RFE or PDA-RFE. SVM-RCE identifies clusters of correlated genes that when considered together provide greater insight into the structure of the microarray data. Clustering genes for classification appears to result in some concomitant clustering of samples into subgroups.</p> <p>Our present implementation of SVM-RCE groups genes using the correlation metric. The success of the SVM-RCE method in classification suggests that gene interaction networks or other biologically relevant metrics that group genes based on functional parameters might also be useful.</p> <p/

    Learning from positive examples when the negative class is undetermined- microRNA gene identification

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The application of machine learning to classification problems that depend only on positive examples is gaining attention in the computational biology community. We and others have described the use of two-class machine learning to identify novel miRNAs. These methods require the generation of an artificial negative class. However, designation of the negative class can be problematic and if it is not properly done can affect the performance of the classifier dramatically and/or yield a biased estimate of performance. We present a study using one-class machine learning for microRNA (miRNA) discovery and compare one-class to two-class approaches using naĆÆve Bayes and Support Vector Machines. These results are compared to published two-class miRNA prediction approaches. We also examine the ability of the one-class and two-class techniques to identify miRNAs in newly sequenced species.</p> <p>Results</p> <p>Of all methods tested, we found that 2-class naive Bayes and Support Vector Machines gave the best accuracy using our selected features and optimally chosen negative examples. One class methods showed average accuracies of 70ā€“80% versus 90% for the two 2-class methods on the same feature sets. However, some one-class methods outperform some recently published two-class approaches with different selected features. Using the EBV genome as and external validation of the method we found one-class machine learning to work as well as or better than a two-class approach in identifying true miRNAs as well as predicting new miRNAs.</p> <p>Conclusion</p> <p>One and two class methods can both give useful classification accuracies when the negative class is well characterized. The advantage of one class methods is that it eliminates guessing at the optimal features for the negative class when they are not well defined. In these cases one-class methods can be superior to two-class methods when the features which are chosen as representative of that positive class are well defined.</p> <p>Availability</p> <p>The OneClassmiRNA program is available at: <abbrgrp><abbr bid="B1">1</abbr></abbrgrp></p

    Isoform-level gene signature improves prognostic stratification and accurately classifies glioblastoma subtypes.

    Get PDF
    Molecular stratification of tumors is essential for developing personalized therapies. Although patient stratification strategies have been successful; computational methods to accurately translate the gene-signature from high-throughput platform to a clinically adaptable low-dimensional platform are currently lacking. Here, we describe PIGExClass (platform-independent isoform-level gene-expression based classification-system), a novel computational approach to derive and then transfer gene-signatures from one analytical platform to another. We applied PIGExClass to design a reverse transcriptase-quantitative polymerase chain reaction (RT-qPCR) based molecular-subtyping assay for glioblastoma multiforme (GBM), the most aggressive primary brain tumors. Unsupervised clustering of TCGA (the Cancer Genome Altas Consortium) GBM samples, based on isoform-level gene-expression profiles, recaptured the four known molecular subgroups but switched the subtype for 19% of the samples, resulting in significant (P = 0.0103) survival differences among the refined subgroups. PIGExClass derived four-class classifier, which requires only 121 transcript-variants, assigns GBM patients' molecular subtype with 92% accuracy. This classifier was translated to an RT-qPCR assay and validated in an independent cohort of 206 GBM samples. Our results demonstrate the efficacy of PIGExClass in the design of clinically adaptable molecular subtyping assay and have implications for developing robust diagnostic assays for cancer patient stratification

    Classification and Prediction of Survival in Patients with the Leukemic Phase of Cutaneous T Cell Lymphoma

    Get PDF
    We have used cDNA arrays to investigate gene expression patterns in peripheral blood mononuclear cells from patients with leukemic forms of cutaneous T cell lymphoma, primarily Sezary syndrome (SS). When expression data for patients with high blood tumor burden (Sezary cells >60% of the lymphocytes) and healthy controls are compared by Student's t test, at P < 0.01, we find 385 genes to be differentially expressed. Highly overexpressed genes include Th2 cellsā€“specific transcription factors Gata-3 and Jun B, as well as integrin Ī²1, proteoglycan 2, the RhoB oncogene, and dual specificity phosphatase 1. Highly underexpressed genes include CD26, Stat-4, and the IL-1 receptors. Message for plastin-T, not normally expressed in lymphoid tissue, is detected only in patient samples and may provide a new marker for diagnosis. Using penalized discriminant analysis, we have identified a panel of eight genes that can distinguish SS in patients with as few as 5% circulating tumor cells. This suggests that, even in early disease, Sezary cells produce chemokines and cytokines that induce an expression profile in the peripheral blood distinctive to SS. Finally, we show that using 10 genes, we can identify a class of patients who will succumb within six months of sampling regardless of their tumor burden

    Peripheral Immune Cell Gene Expression Predicts Survival of Patients with Non-Small Cell Lung Cancer

    Get PDF
    Prediction of cancer recurrence in patients with non-small cell lung cancer (NSCLC) currently relies on the assessment of clinical characteristics including age, tumor stage, and smoking history. A better prediction of early stage cancer patients with poorer survival and late stage patients with better survival is needed to design patient-tailored treatment protocols. We analyzed gene expression in RNA from peripheral blood mononuclear cells (PBMC) of NSCLC patients to identify signatures predictive of overall patient survival. We find that PBMC gene expression patterns from NSCLC patients, like patterns from tumors, have information predictive of patient outcomes. We identify and validate a 26 gene prognostic panel that is independent of clinical stage. Many additional prognostic genes are specific to myeloid cells and are more highly expressed in patients with shorter survival. We also observe that significant numbers of prognostic genes change expression levels in PBMC collected after tumor resection. These post-surgery gene expression profiles may provide a means to re-evaluate prognosis over time. These studies further suggest that patient outcomes are not solely determined by tumor gene expression profiles but can also be influenced by the immune response as reflected in peripheral immune cells

    PAF-R on activated T cells: Role in the IL-23/Th17 pathway and relevance to multiple sclerosis.

    Get PDF
    IL-23 is a potent stimulus for Th17 cells. These cells have a distinct developmental pathway from Th1 cells induced by IL-12 and are implicated in autoimmune and inflammatory disorders including multiple sclerosis (MS). TGF-Ī², IL-6, and IL-1, the transcriptional regulator RORĪ³t (RORC) and IL-23 are implicated in Th17 development and maintenance. In human polyclonally activated T cells, IL-23 enhances IL-17 production. The aims of our study were: 1). To validate microarray results showing preferential expression of platelet activating factor receptor (PAF-R) on IL-23 stimulated T cells. 2). To determine whether PAF-R on activated T cells is functional, whether it is co-regulated with Th17-associated molecules, and whether it is implicated in Th17 function. 3). To determine PAF-R expression in MS. We show that PAF-R is expressed on activated T cells, and is inducible by IL-23 and IL-17, which in turn are induced by PAF binding to PAF-R. PAF-R is co-expressed with IL-17 and regulated similarly with Th17 markers IL-17A, IL-17F, IL-22 and RORC. PAF-R is upregulated on PBMC and T cells of MS patients, and levels correlate with IL-17 and with MS disability scores. Our results show that PAF-R on T cells is associated with the Th17 phenotype and function. Clinical Implications Targeting PAF-R may interfere with Th17 function and offer therapeutic intervention in Th17-associated conditions, including MS

    Classification and biomarker identification using gene network modules and support vector machines

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Classification using microarray datasets is usually based on a small number of samples for which tens of thousands of gene expression measurements have been obtained. The selection of the genes most significant to the classification problem is a challenging issue in high dimension data analysis and interpretation. A previous study with SVM-RCE (Recursive Cluster Elimination), suggested that classification based on groups of correlated genes sometimes exhibits better performance than classification using single genes. Large databases of gene interaction networks provide an important resource for the analysis of genetic phenomena and for classification studies using interacting genes.</p> <p>We now demonstrate that an algorithm which integrates network information with recursive feature elimination based on SVM exhibits good performance and improves the biological interpretability of the results. We refer to the method as SVM with Recursive Network Elimination (SVM-RNE)</p> <p>Results</p> <p>Initially, one thousand genes selected by t-test from a training set are filtered so that only genes that map to a gene network database remain. The Gene Expression Network Analysis Tool (GXNA) is applied to the remaining genes to form <it>n </it>clusters of genes that are highly connected in the network. Linear SVM is used to classify the samples using these clusters, and a weight is assigned to each cluster based on its importance to the classification. The least informative clusters are removed while retaining the remainder for the next classification step. This process is repeated until an optimal classification is obtained.</p> <p>Conclusion</p> <p>More than 90% accuracy can be obtained in classification of selected microarray datasets by integrating the interaction network information with the gene expression information from the microarrays.</p> <p>The Matlab version of SVM-RNE can be downloaded from <url>http://web.macam.ac.il/~myousef</url></p

    A Novel Cross-Disciplinary Multi-Institute Approach to Translational Cancer Research: Lessons Learned from Pennsylvania Cancer Alliance Bioinformatics Consortium (PCABC)

    Get PDF
    Background: The Pennsylvania Cancer Alliance Bioinformatics Consortium (PCABC, http://www.pcabc.upmc.edu) is one of the first major project-based initiatives stemming from the Pennsylvania Cancer Alliance that was funded for four years by the Department of Health of the Commonwealth of Pennsylvania. The objective of this was to initiate a prototype biorepository and bioinformatics infrastructure with a robust data warehouse by developing a statewide data model (1) for bioinformatics and a repository of serum and tissue samples; (2) a data model for biomarker data storage; and (3) a public access website for disseminating research results and bioinformatics tools. The members of the Consortium cooperate closely, exploring the opportunity for sharing clinical, genomic and other bioinformatics data on patient samples in oncology, for the purpose of developing collaborative research programs across cancer research institutions in Pennsylvania. The Consortiumā€™s intention was to establish a virtual repository of many clinical specimens residing in various centers across the state, in order to make them available for research. One of our primary goals was to facilitate the identification of cancer specific biomarkers and encourage collaborative research efforts among the participating centers.Methods: The PCABC has developed unique partnerships so that every region of the state can effectively contribute and participate. It includes over 80 individuals from 14 organizations, and plans to expand to partners outside the State. This has created a network of researchers, clinicians, bioinformaticians, cancer registrars, program directors, and executives from academic and community health systems, as well as external corporate partners - all working together to accomplish a common mission. The various sub-committees have developed a common IRB protocol template, common data elements for standardizing data collections for three organ sites, intellectual property/tech transfer agreements, and material transfer agreements that have been approved by each of the member institutions. This was the foundational work that has led to the development of a centralized data warehouse that has met each of the institutionsā€™ IRB/HIPAA standards.Results: Currently, this ā€œvirtual biorepositoryā€ has over 58,000 annotated samples from 11,467 cancer patients available for research purposes. The clinical annotation of tissue samples is either done manually over the internet or semiautomated batch modes through mapping of local data elements with PCABC common data elements. The database currently holds information on 7188 cases (associated with 9278 specimens and 46,666 annotated blocks and blood samples) of prostate cancer, 2736 cases (associated with 3796 specimens and 9336 annotated blocks and blood samples) of breast cancer and 1543 cases (including 1334 specimens and 2671 annotated blocks and blood samples) of melanoma. These numbers continue to grow, and plans to integrate new tumor sites are in progress. Furthermore, the group has also developed a central web-based tool that allows investigators to share their translational (genomics/proteomics) experiment data on research evaluating potential biomarkers via a central location on the Consortiumā€™s web site.Conclusions: The technological achievements and the statewide informatics infrastructure that have been established by the Consortium will enable robust and efficient studies of biomarkers and their relevance to the clinical course of cancer. Studies resulting from the creation of the Consortium may allow for better classification of cancer types, more accurate assessment of disease prognosis, a better ability to identify the most appropriate individuals for clinical trial participation, and better surrogate markers of disease progression and/or response to therapy

    An integrative ChIP-chip and gene expression profiling to model SMAD regulatory modules

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The TGF-Ī²/SMAD pathway is part of a broader signaling network in which crosstalk between pathways occurs. While the molecular mechanisms of TGF-Ī²/SMAD signaling pathway have been studied in detail, the global networks downstream of SMAD remain largely unknown. The regulatory effect of SMAD complex likely depends on transcriptional modules, in which the SMAD binding elements and partner transcription factor binding sites (SMAD modules) are present in specific context.</p> <p>Results</p> <p>To address this question and develop a computational model for SMAD modules, we simultaneously performed chromatin immunoprecipitation followed by microarray analysis (ChIP-chip) and mRNA expression profiling to identify TGF-Ī²/SMAD regulated and synchronously coexpressed gene sets in ovarian surface epithelium. Intersecting the ChIP-chip and gene expression data yielded 150 direct targets, of which 141 were grouped into 3 co-expressed gene sets (sustained up-regulated, transient up-regulated and down-regulated), based on their temporal changes in expression after TGF-Ī² activation. We developed a data-mining method driven by the Random Forest algorithm to model SMAD transcriptional modules in the target sequences. The predicted SMAD modules contain SMAD binding element and up to 2 of 7 other transcription factor binding sites (E2F, P53, LEF1, ELK1, COUPTF, PAX4 and DR1).</p> <p>Conclusion</p> <p>Together, the computational results further the understanding of the interactions between SMAD and other transcription factors at specific target promoters, and provide the basis for more targeted experimental verification of the co-regulatory modules.</p

    Modulation of gene expression in heart and liver of hibernating black bears (Ursus americanus)

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Hibernation is an adaptive strategy to survive in highly seasonal or unpredictable environments. The molecular and genetic basis of hibernation physiology in mammals has only recently been studied using large scale genomic approaches. We analyzed gene expression in the American black bear, <it>Ursus americanus</it>, using a custom 12,800 cDNA probe microarray to detect differences in expression that occur in heart and liver during winter hibernation in comparison to summer active animals.</p> <p>Results</p> <p>We identified 245 genes in heart and 319 genes in liver that were differentially expressed between winter and summer. The expression of 24 genes was significantly elevated during hibernation in both heart and liver. These genes are mostly involved in lipid catabolism and protein biosynthesis and include RNA binding protein motif 3 (<it>Rbm3</it>), which enhances protein synthesis at mildly hypothermic temperatures. Elevated expression of protein biosynthesis genes suggests induction of translation that may be related to adaptive mechanisms reducing cardiac and muscle atrophies over extended periods of low metabolism and immobility during hibernation in bears. Coordinated reduction of transcription of genes involved in amino acid catabolism suggests redirection of amino acids from catabolic pathways to protein biosynthesis. We identify common for black bears and small mammalian hibernators transcriptional changes in the liver that include induction of genes responsible for fatty acid Ī² oxidation and carbohydrate synthesis and depression of genes involved in lipid biosynthesis, carbohydrate catabolism, cellular respiration and detoxification pathways.</p> <p>Conclusions</p> <p>Our findings show that modulation of gene expression during winter hibernation represents molecular mechanism of adaptation to extreme environments.</p
    • ā€¦
    corecore